1. Exploratory Data Analysis (EDA)

Table of contents:

1.1 Prerequisites

Import required libraries. Before running the notebook, it is assumed that the user has already installed the required libraries contained in requirements.txt .

Since the notebooks and utility functions are under src folder of the repository, os.chdir("..") is used as an easy way to move back one directory to access the folders. This serves as a workaround for the initial exploration.

1.2 Load raw data

Modify the DATA_LOCATION directory as needed. This configuration sets it as src/data in the repository. This assumes that the required data is already located within the directory e.g.

us-traffic
│   ...    
│
└───src
|   |   datasetdownloader.py
│   │   ...
│   │
│   └───data  <- (DATA_LOCATION)
|       |    dot_traffic_2015.txt.gz
|       |    dot_traffic_stations_2015.txt.gz
│       |
|       └─── ...
|
...

Alternatively, the user can also download the files via the datadownloader.py script under src.

There are two main files under the dataset:

1.3 Explore available columns

Upon checking the initial entries for both dataframes, we can see that traffic_data mainly contains historical information regarding traffic volume and traffic_stations contains the data regarding descriptions of the stations which collected the traffic volume information.

1.3.1 Common columns

By checking the columns, we can see that there are some common columns for both traffic_data and traffic_stations. We can use the common columns to match relevant features (e.g. spatial information) from traffic_stations to traffic_data entries when we do preprocessing for model input later on.

While it's possible to match the rows through station_id alone, it relies on the assumption that there is a one to one correspondence between a specific location/set of coordinates to a single station_id.

This assumption may be proven false below.

As seen in the sub dataframe above, the station_id "000302" is not unique to a specific location and appears across different longitude, latitude, and FIPS codes.

Therefore, to get the most appropriate spatial features for a specific row in traffic_data we need to make use of other common columns for matching.

1.3.2 Correspondence of named columns

Since there are multiple columns with numerical encoding and named equivalent for both traffic_data and traffic_stations that indicate the same information (e.g. direction_of_travel and direction_of_travel_name), we can check the assignment and validate if the numerical encoding is unique to each value.

It can be seen that there are no repeated entries except for "Other lanes" in lane_of_travel_name. However it is still valid as the multiple entries under "Other lanes" only appear under it and not for the other travel lane names.

Therefore, we can opt to just retrieve the columns with "_name" and perform encoding before modeling.

1.3.3 Define state names for available FIPS codes

The numbers under fips_state_code correspond to state names in the US. External information from government sites in US were retrieved to further understand FIPS data present in the dataset. We can check the states along with the number of entries under traffic_data.

We retrieve and process the following dataframes for reference:

1.3.4 Check for null values and incomplete data in traffic data

Since restrictions are just full of NaN columns and isn't a common column between traffic_data and traffic_stations, we can opt to drop/disregard it.

Check a sample state to check if the stations collect data for all the dates in the given year of 2015.

From this we can see that there are multiple entries per station in a given state. This is due to the direction of travel from the data collected. Since we are only interested in the collective traffic volume collected by the station in a specific location for a given time, we can compress the traffic volume later on by getting the sum.

For now, we can retrieve the unique dates regardless of direction of travel to get the dates where the station collected data in a given state. This is to see if there are gaps in the data. Since a 1 day gap would result in 24 data points lost, it would be hard to supplement the gap with a linear/average approximation due to the inherent seasonality of the data. However it would be possible to fill the gaps if the forecasting to be done is daily.

We can then adjust the approach when transforming the data as model input.

1.3.5 Check for null values and incorrect data in traffic stations

If we check a portion of traffic_stations, there are a lot of null values but since we're mainly interested in analyzing traffic volume for the data, data from traffic_stations is mostly used to supplement our findings and add features for modeling. By checking the spatial columns, we narrow down the features we want to explore for now and note the other columns for future exploration.

Look for the row with the null longitude and latitude entry.

Correct the longitude and latitude.

We assume that 0 entries throughout the dataset is normal since we know that there may be certain areas with lower traffic compared to urban or high density locations.

At first glance, we can already see that there are anomalous entries in the dataset since there are entries with 0 longitude and latitude and even a ~990 entry for longitude. For reference, a quick google search would show that the coordinates for the US is 37.0902° N, 95.7129° W.

Visually below, we can see that there are some stations located outside of the US. We can also see that in some parts, stations are sparse and could also attribute to lower daily traffic volume data collected.

From the plots below, we can see that some entries for the stations are located outside of the US. There seems to be a common pattern wherein the offset is off by 1 tens place (e.g. 987 should be 98.7 relative to the coordinates of the US).

Correct zero columns by retrieving the average longitude and latitude to supplement rows with no values.

Correct the incorrect offset for the longitude and latitude entries.

We can see here that while most of the station IDs are proprerly inside of South Dakota, there are still some misplaced stations. We can click on the circles on the map to check the details of these stations. One of the stations "000014" is located somewhere in Montreal. Since we are unsure of the proper offset for these set of values, we can apply thresholding based on the coordinates in fips_loc_df later on to disregard entries for these stations due to the nature of the unreliable data.

1.4 Temporal analysis

Convert date column to datetime.

Since there are hourly entries, the scope of the dataset is the entirety of 2015 from January 1, 2015 until December 31, 2015. We can set aside the months of November and December for the test set if we are to do forecasting.

Get sum across the hourly traffic volume.

Check behavior across all states.

1.4.1 Daily and hourly traffic volume patterns

Check the date/entry with the sudden dip.

Retrieves hourly rows and transforms it into a 1D array.

Animation code is based on the code here

Uncomment to retrieve the sample animation. Based on this

Graphs per hour for each day in January 2015 are superimposed. It can be seen that there appears to be common trends per hour.

During different parts of the day, the trends for the traffic volume differs e.g. for hours during midnight to early morning (0-5AM), it can be seen that the traffic is low since most human activities such as regular office hours and school occur during morning to afternoon. This can be further verified by seeing the spike in traffic volume around midmornings (7-9AM) wherein public transportation such as buses/taxis and private vehicles are being used to go to schools, business establishments, offices, and others.

After stable traffic volumes during early afternoon (11AM-3PM), there are sudden spikes as people most likely return home after their time outside and slowly winding down further as it goes on to the night.

Because of this, we can add parts of day as a feature for our models when forecasting the data.

Since the day of week values in the dataset are encoded numerically, we can check the named equivalent for the day of week as shown below.

From the plot below we can see that the traffic volume for weekends (Sunday and Saturday) are lower compared to their weekday counterparts. Additionally, there seems to be a slightly higher volume during Friday.

We can further check the behavior of the traffic volume per hour during the day of week. We can see here that there is a significantly lower traffic volume during early mornings during the weekend. This may be attributed to activities such as schools and regular office hours only occuring during the weekdays, thus lowering the traffic volume during weekends.

1.4.4 Statistical tests

Use Augmented Dickey Fuller test to check for stationarity for the data. If the p-value is relatively low, we can opt to not transform the data further prior to modeling.

Modify the parameters below as needed for analysis.

The following plots checks the timeseries decomposition for the hourly traffic volume for the data collected from the given station in the state across a daily period. For hourly_sub_aggr, the entries are transformed to be hourly instead of daily.

Based on the default values in this notebook, hourly_sub_aggr was collected from station "930198" in the state of Florida (FIPS state code 12).

The following shows the timeseries decomposition for the daily total traffic volume to check for trends on a weekly interval. Since the vol_aggr has daily entries, the period is set to 7.

1.5 Spatial Analysis

1.5.1 Module for visualizing stations per state

Create a module for quick checking of station plots across states.

1.5.2 Average traffic volume per state

There are many possible reasons for high daily traffic volume such as density of population in the state, amount of stations collecting data and coordinates, amount of roads or major highways across the state providing interstate travel, establishments for leisure and providing necessities as well as offices and schools in an area, and many others.

Check correlation between the number of stations in a given state and the average daily traffic volume.

1.5.3 Explore relation between average daily traffic volume and coordinate values

Check highest daily traffic volume across the dataset.

Check the general trend for urban and rural areas.

We can see here that compared to rural traffic volumes, urban areas typically have higher average traffic volume across all states. However, there are some values such as "Rural: Principal Arterial - Interstate" which has higher average daily traffic volume but is still much lower compared to it's urban counterpart "Urban: Principal Arterial - Interstate"

1.6 EDA Summary

During our initial analysis, we can see from the data that there are hourly entries for traffic volume data collected by stations for a given state in the US for the entire year of 2015. Per state, there are counties and more spatial information such as urban vs. rural, longitude, and latitude values. Upon checking the data, there are gaps between daily entries but no null values were found for the hourly entries. While there are 0 and negative values, we assume that the sensors for each station are properly calibrated and leave those values untouched as these values may be intentional per station. We also assume that while stations may have different sensors, the traffic volume data entries are normalized to be of the same unit in the dataset.

While checking the traffic stations, it was seen that there were some entries with common patterns of incorrect longitude and latitude data. This was verified by collecting external data by matching the FIPS state codes to their approximate longitude and latitude. These values were corrected with an offset and it was seen through visualization that these corrected values are more appropriate given their state location. However, it was seen that there are still some stations with anomalous values. Since these values cannot be corrected by a simple multiplier, it is noted that it is possible to leave out these stations during visualization or forecasting due to the unreliable nature of the data. Other incorrect data points were corrected during analysis and some columns were left out since they mostly contained information regarding the stations and often had NaN values--these columns are noted for future exploration.

Afterwards, we check the generalized temporal behavior of the data and see trends regarding hour of day, part of day, day of week. Statistical tests were also applied and it was seen that the p-value of the hourly traffic volume time series data has a low p-value which indicates it being stationary and not needing further pre-processing. However it is noted that the daily time series data for the traffic volume is relatively high so forecasting for daily traffic volume may need pre-processing to adjust these values (e.g. via log).

Lastly, spatial analysis was done by comparing the traffic trends and amount of stations that collected data per state. Traffic volume for states and possible correlation given longitude and latitude values were also explored. It was also confirmed that urban areas have higher daily traffic volume compared to rural areas and may be used as features for modeling.

After our EDA, we can move onto feature engineering and data pre-processing to prepare our time series data for forecasting.